Search CORE

52 research outputs found

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Author: Chen Zhengyang
Han Bing
Qian Yanmin
Wang Shuai
Publication venue
Publication date: 12/09/2023
Field of study

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under Revie

arXiv.org e-Print Archive

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Author: Chen Zhengyang
Han Bing
Qian Yanmin
Wang Shuai
Publication venue
Publication date: 01/06/2023
Field of study

This paper proposes a novel Attention-based Encoder-Decoder network for End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we incorporate the target speaker enrollment information used in target speaker voice activity detection (TS-VAD) to calculate the attractor, which can mitigate the speaker permutation problem and facilitate easier model convergence. In the training process, we propose a teacher-forcing strategy to obtain the enrollment information using the ground-truth label. Furthermore, we propose three heuristic decoding methods to identify the enrollment area for each speaker during the evaluation process. Additionally, we enhance the attractor calculation network LSTM used in the end-to-end encoder-decoder based attractor calculation (EEND-EDA) system by incorporating an attention-based model. By utilizing such an attention-based attractor decoder, our proposed AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s of enrollment data.Comment: Accepted by InterSpeech 202

arXiv.org e-Print Archive

Estimation of the Relationship Between Remotely Sensed Anthropogenic Heat Discharge and Building Energy Use

Author: Gurney Kevin R.
Hu Xuefei
Shuai Yanmin
Weng Qihao
Zhou Yuyu
Publication venue
Publication date
Field of study

This paper examined the relationship between remotely sensed anthropogenic heat discharge and energy use from residential and commercial buildings across multiple scales in the city of Indianapolis, Indiana, USA. The anthropogenic heat discharge was estimated with a remote sensing-based surface energy balance model, which was parameterized using land cover, land surface temperature, albedo, and meteorological data. The building energy use was estimated using a GIS-based building energy simulation model in conjunction with Department of Energy/Energy Information Administration survey data, the Assessor's parcel data, GIS floor areas data, and remote sensing-derived building height data. The spatial patterns of anthropogenic heat discharge and energy use from residential and commercial buildings were analyzed and compared. Quantitative relationships were evaluated across multiple scales from pixel aggregation to census block. The results indicate that anthropogenic heat discharge is consistent with building energy use in terms of the spatial pattern, and that building energy use accounts for a significant fraction of anthropogenic heat discharge. The research also implies that the relationship between anthropogenic heat discharge and building energy use is scale-dependent. The simultaneous estimation of anthropogenic heat discharge and building energy use via two independent methods improves the understanding of the surface energy balance in an urban landscape. The anthropogenic heat discharge derived from remote sensing and meteorological data may be able to serve as a spatial distribution proxy for spatially-resolved building energy use, and even for fossil-fuel CO2 emissions if additional factors are considered

NASA Technical Reports Server

An Approach for the Long-Term 30-m Land Surface Snow-Free Albedo Retrieval from Historic Landsat Surface Reflectance and MODIS-based A Priori Anisotropy Knowledge

Author: Gao Feng
He Tao
Masek Jeffrey G.
Schaaf Crystal B.
Shuai Yanmin
Publication venue
Publication date
Field of study

Land surface albedo has been recognized by the Global Terrestrial Observing System (GTOS) as an essential climate variable crucial for accurate modeling and monitoring of the Earth's radiative budget. While global climate studies can leverage albedo datasets from MODIS, VIIRS, and other coarse-resolution sensors, many applications in heterogeneous environments can benefit from higher-resolution albedo products derived from Landsat. We previously developed a "MODIS-concurrent" approach for the 30-meter albedo estimation which relied on combining post-2000 Landsat data with MODIS Bidirectional Reflectance Distribution Function (BRDF) information. Here we present a "pre-MODIS era" approach to extend 30-m surface albedo generation in time back to the 1980s, through an a priori anisotropy Look-Up Table (LUT) built up from the high quality MCD43A BRDF estimates over representative homogenous regions. Each entry in the LUT reflects a unique combination of land cover, seasonality, terrain information, disturbance age and type, and Landsat optical spectral bands. An initial conceptual LUT was created for the Pacific Northwest (PNW) of the United States and provides BRDF shapes estimated from MODIS observations for undisturbed and disturbed surface types (including recovery trajectories of burned areas and non-fire disturbances). By accepting the assumption of a generally invariant BRDF shape for similar land surface structures as a priori information, spectral white-sky and black-sky albedos are derived through albedo-to-nadir reflectance ratios as a bridge between the Landsat and MODIS scale. A further narrow-to-broadband conversion based on radiative transfer simulations is adopted to produce broadband albedos at visible, near infrared, and shortwave regimes.We evaluate the accuracy of resultant Landsat albedo using available field measurements at forested AmeriFlux stations in the PNW region, and examine the consistency of the surface albedo generated by this approach respectively with that from the "concurrent" approach and the coincident MODIS operational surface albedo products. Using the tower measurements as reference, the derived Landsat 30-m snow-free shortwave broadband albedo yields an absolute accuracy of 0.02 with a root mean square error less than 0.016 and a bias of no more than 0.007. A further cross-comparison over individual scenes shows that the retrieved white sky shortwave albedo from the "pre-MODIS era" LUT approach is highly consistent (R(exp 2) = 0.988, the scene-averaged low RMSE = 0.009 and bias = 0.005) with that generated by the earlier "concurrent" approach. The Landsat albedo also exhibits more detailed landscape texture and a wider dynamic range of albedo values than the coincident 500-m MODIS operational products (MCD43A3), especially in the heterogeneous regions. Collectively, the "pre-MODIS" LUT and "concurrent" approaches provide a practical way to retrieve long-term Landsat albedo from the historic Landsat archives as far back as the 1980s, as well as the current Landsat-8 mission, and thus support investigations into the evolution of the albedo of terrestrial biomes at fine resolution

NASA Technical Reports Server

Use of In Situ and Airborne Multiangle Data to Assess MODIS- and Landsat-based Estimates of Surface Albedo

Author: Gao Feng
Gatebe Charles K.
Masek Jeff
Roman Miguel O.
Schaaf Crystal B.
Shuai Yanmin
Wang Zhuosen
Publication venue
Publication date
Field of study

The quantification of uncertainty of global surface albedo data and products is a critical part of producing complete, physically consistent, and decadal land property data records for studying ecosystem change. A current challenge in validating satellite retrievals of surface albedo is the ability to overcome the spatial scaling errors that can contribute on the order of 20% disagreement between satellite and field-measured values. Here, we present the results from an uncertain ty analysis of MODerate Resolution Imaging Spectroradiometer (MODIS) and Landsat albedo retrievals, based on collocated comparisons with tower and airborne multi-angular measurements collected at the Atmospheric Radiation Measurement Program s (ARM) Cloud and Radiation Testbed (CART) site during the 2007 Cloud and Land Surface Interaction Campaign (CLAS33 IC 07). Using standard error propagation techniques, airborne measurements obtained by NASA s Cloud Absorption Radiometer (CAR) were used to quantify the uncertainties associated with MODIS and Landsat albedos across a broad range of mixed vegetation and structural types. Initial focus was on evaluating inter-sensor consistency through assessments of temporal stability, as well as examining the overall performance of satellite-derived albedos obtained at all diurnal solar zenith angles. In general, the accuracy of the MODIS and Landsat albedos remained under a 10% margin of error in the SW(0.3 - 5.0 m) domain. However, results reveal a high degree of variability in the RMSE (root mean square error) and bias of albedos in both the visible (0.3 - 0.7 m) and near-infrared (0.3 - 5.0 m) broadband channels; where, in some cases, retrieval uncertainties were found to be in excess of 20%. For the period of CLASIC 07, the primary factors that contributed to uncertainties in the satellite-derived albedo values include: (1) the assumption of temporal stability in the retrieval of 500 m MODIS BRDF values over extended periods of cloud-contaminated observations; and (2) the assumption of spatial 45 and structural uniformity at the Landsat (30 m) pixel scale

NASA Technical Reports Server

USED: Universal Speaker Extraction and Diarization

Author: Ao Junyi
Deng Liqun
Ge Meng
Li Haizhou
Qian Yanmin
Tao Ruijie
Wang Shuai
Xiao Longshuai
Yıldırım Mehmet Sinan
Publication venue
Publication date: 19/09/2023
Field of study

Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at https://ajyy.github.io/demo/USED/.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Estimation of Crop Gross Primary Production (GPP)

Author: Cheng Yen-Ben
Lyapustin Alexei I.
Middleton Elizabeth M.
Shuai Yanmin
Suyker Andrew
Verma Shashi
Wang Yujie
Zhang Qingyuan
Zhang Xiaoyang
Publication venue
Publication date
Field of study

Satellite remote sensing estimates of Gross Primary Production (GPP) have routinely been made using spectral Vegetation Indices (VIs) over the past two decades. The Normalized Difference Vegetation Index (NDVI), the Enhanced Vegetation Index (EVI), the green band Wide Dynamic Range Vegetation Index (WDRVIgreen), and the green band Chlorophyll Index (CIgreen) have been employed to estimate GPP under the assumption that GPP is proportional to the product of VI and photosynthetically active radiation (PAR) (where VI is one of four VIs: NDVI, EVI, WDRVIgreen, or CIgreen). However, the empirical regressions between VI*PAR and GPP measured locally at flux towers do not pass through the origin (i.e., the zero X-Y value for regressions). Therefore they are somewhat difficult to interpret and apply. This study investigates (1) what are the scaling factors and offsets (i.e., regression slopes and intercepts) between the fraction of PAR absorbed by chlorophyll of a canopy (fAPARchl) and the VIs, and (2) whether the scaled VIs developed in (1) can eliminate the deficiency and improve the accuracy of GPP estimates. Three AmeriFlux maize and soybean fields were selected for this study, two of which are irrigated and one is rainfed. The four VIs and fAPARchl of the fields were computed with the MODerate resolution Imaging Spectroradiometer (MODIS) satellite images. The GPP estimation performance for the scaled VIs was compared to results obtained with the original VIs and evaluated with standard statistics: the coefficient of determination (R2), the root mean square error (RMSE), and the coefficient of variation (CV). Overall, the scaled EVI obtained the best performance. The performance of the scaled NDVI, EVI and WDRVIgreen was improved across sites, crop types and soil/background wetness conditions. The scaled CIgreen did not improve results, compared to the original CIgreen. The scaled green band indices (WDRVIgreen, CIgreen) did not exhibit superior performance to either the scaled EVI or NDVI in estimating crop daily GPP at these agricultural fields. The scaled VIs are more physiologically meaningful than original un-scaled VIs, but scaling factors and offsets may vary across crop types and surface conditions

NASA Technical Reports Server

Early Spring Post-Fire Snow Albedo Dynamics in High Latitude Boreal Forests Using Landsat-8 OLI Data

Author: Casey Kimberly A.
Erb Angela M.
Liu Yan
Roman Miguel O.
Schaaf Crystal B.
Shuai Yanmin
Sun Qingsong
Wang Zhuosen
Yang Yun
Publication venue
Publication date
Field of study

Taking advantage of the improved radiometric resolution of Landsat-8 OLI which, unlike previous Landsat sensors, does not saturate over snow, the progress of fire recovery progress at the landscape scale (less than 100 m) is examined. High quality Landsat-8 albedo retrievals can now capture the true reflective and layered character of snow cover over a full range of land surface conditions and vegetation densities. This new capability particularly improves the assessment of post-fire vegetation dynamics across low- to high-burn severity gradients in Arctic and boreal regions in the early spring, when the albedos during recovery show the greatest variation. We use 30 m resolution Landsat-8 surface reflectances with concurrent coarser resolution (500 m) MODIS high quality full inversion surface Bidirectional Reflectance Distribution Functions (BRDF) products to produce higher resolution values of surface albedo. The high resolution full expression shortwave blue sky albedo product performs well with an overall RMSE of 0.0267 between tower and satellite measures under both snow-free and snow-covered conditions. While the importance of post-fire albedo recovery can be discerned from the MODIS albedo product at regional and global scales, our study addresses the particular importance of early spring post-fire albedo recovery at the landscape scale by considering the significant spatial heterogeneity of burn severity, and the impact of snow on the early spring albedo of various vegetation recovery types. We found that variations in early spring albedo within a single MODIS gridded pixel can be larger than 0.6. Since the frequency and severity of wildfires in Arctic and boreal systems is expected to increase in the coming decades, the dynamics of albedo in response to these rapid surface changes will increasingly impact the energy balance and contribute to other climate processes and physical feedback mechanisms. Surface radiation products derived from Landsat-8 data will thus play an important role in characterizing the carbon cycle and ecosystem processes of high latitude systems

NASA Technical Reports Server